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We study the general problem of mixing for ab-initio quantum-mechanical problems. 
Guided by general mathematical principles and the underlying physics, we propose a mul- 
tisecant form of Broyden's second method for solving the self-consistent field equations of 
Kohn-Sham density functional theory. The algorithm is robust, requires relatively little fine- 
tuning and appears to outperform the current state of the art, converging for cases that 
defeat many other methods. We compare our technique to the conventional methods for 
problems ranging from simple to nearly pathological. 
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I. INTRODUCTION 



We consider the problem of determining the electron density p that satisfies the self-consistent 
field equations according to the Kohn-Sham density functional theorjU^^l 

{Ho + Vp)(l)i = eict)i (I.la) 
p{x) = 5^ (l + e^^^'-'^))"' (I-lb) 

i 

Here Hq is the single-particle noninteracting Hamiltonian and Vp is an effective potential param- 
eterized by the particle density p. The constant /? is \/kT where k is Boltzmann's constant and 
T is temperature. The term (1 -|- e'^'^'^'"'^'')""'^ is the Fermi-Dirac occupation and the constant p is 
determined by / p{x)dx = N for an A^-body problem. Following"^ we let Hp := Hq + AVjJ^ denote 
the Kohn-Sham Hamiltonian and reformulate the above system of equations as a nonlinear fixed 
point problem: find p such that 

F{p){x) := (l + e^^"^'^'^y\x,x) = p{x) (1.2) 

where p is the unique solution to iV = trace{{\ + e^'^^p^^^^ We refer to the operator F above 
as the self-consistent field (SCF) operator. We will not be concerned with the details of the SCF 
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operator or its approximations since these tend to be specific to the appHcation. Also, we will work 
with the discretized version of the SCF operator, which we will call the SCF mapping since it is 
a real vector-valued mapping of the discretized density. Throughout this work, however, we will 
point to instances where the form of this mapping can cause problems for numerical procedures 



for solving Eq.(1.2) 



Numerical algorithms for solving Eq.(1.2) abound - the representative examples we focus on 
here ar c * * * * * * * I These are iterative procedures and the process of determining the desired 
density p from previous estimates has come to be known as "mixing" in the physical literature. For 
ab-initio methods there is frequently a user-provided mixing term which, if it is improperly chosen, 
will lead to divergence of the iterations. In many cases the user has to learn by failure what is the 
correct value to use, expending a fair amount of computer resources in the process. We will show 
that many of the methods found in the physical literature have counterparts in the mathematical 
literature thus providing a more systematic approach to the choice of algorithm parameters. The 
goal of this work is the development of a method that does not require expert user input, is fast, 
and can handle many of the more complicated and poorly convergent problems such as metallic 
surfaces or heterostructures that can defeat a novice and sometimes an expert. 

In the next section we provide detailed background both to the mathematical literature on 
methods for solving non-linear equations, as well the physical literature on mixing. There are two 
major "branches" of numerical techniques distinguished by the space in which they are derived. 
However we show that all of the different methods are generated by solving an optimization prob- 



lem of the same form. In Section III we detail our proposed algorithm for solving Eq.(1.2). We 
show in Section III that most algorithms are predicated upon fixed point mappings with a great 
deal of regularity, in particular monotone mappings. Motivated by the prospect that most systems 
of physical interest do not lead to convex SCF mappings, the principal insight that yields our 
improvements is to treat the prior steps as random samples in a higher-dimensional space rather 
than deterministic steps in a path to a fixed point. In effect, the Kohn-Sham Hamiltonian eval- 
uated at an electron density far from equilibrium can vary in a semirandom and perhaps chaotic 
fashion. This viewpoint leads to a natural interpretation of the updates in terms of predicted and 
unpredicted components. Controls on the algorithm are exerted through parameter choices that 
affect primarily the unpredicted step. The rest of the section involves technical considerations for 
controlling step sizes and safeguarding against instabilities. We introduce the idea of using proxi- 
mal mappingJI21 to account for problems associated with near- linear dependencies and, at the same 
time, to safeguard against uncertainties in the model analogous to a classic Wiener filter. This 
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regularization has the advantage that it also acts similarly to a classic trust region technique in 



numerical optimization. In Section IV we present numerical results for both very easy problems as 
well as semi-pathological cases. The new approach outperforms existing algorithms in most cases, 
and does significantly better with poorly constructed Kohn-Sham mappings. The algorithm is also 
relatively insensitive to user input. We conclude with a discussion of some of the open issues. 

II. ITERATIVE METHODS FOR SOLVING NONLINEAR EQUATIONS 

We wish to determine the electron density p^, with fixed atom locations. The density is a 
real- valued vector with k elements. With an estimated density pn at the n-th step of an iterative 
procedure for determining p*, we check whether our estimate satisfies the ab-initio self-consistent 



field (SCF) equations given by Eq.(1.2). Evaluation of the SCF mapping returns a modified density 
p'^ := F{pn), another real- valued vector with k elements. The density we seek is a fixed point of 
F, i.e., we solve the system of non-linear equations 

F{p,)-p, = 0. (II.l) 

This suggests the usual Newton algorithm as a possible numerical solution strategy. 

A. Newton's Method 



Expressing the mapping in Eq.(II.l) by its Taylor series expansion centered on a fixed point p^ 
yields 

F{p) -p = {J{p,) - I){p - p,) + 0{\p - pA'') (II.2) 

where J{p*) is the Jacobian of F (supposing this is well-defined) at p^, and / is the identity 
mapping. The Jacobian J is a k-hy-k real- valued matrix. Given a point p„, Newton's method 
generates the next point Pn+i by 



Pn+l 



Pn - {J{Pn) - irHHpn) " Pn)- (H-S) 



Under standard assumptions, this iteration can be shown to converge quadratically in a neigh- 
borhood of a local solution^^. The computational cost of calculating the Jacobian and inverting is, 
however, prohibitive for high-dimensional problems such as density functional calculations. 
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B. Matrix Secant Methods 



A classical approach to avoid computing and inverting the Jacobian is via solutions to the 
matrix secant equation: ( J(p„) — /) ~ Bn where 

BniPn - Pn-l) = [{F{pn) " Pn) " " Pn-l)) (11.4) 

Introducing new variables, a conventional reduction, this is compactly represented as 

(11.5) 



or 



where Hn = ^ and 



BnSn-l — Un-l 



HnUn-l — Sn-1 



Sn-l=Pn-Pn-l and yn-1 = {F{pn) - Pn) - {F{pn-l) - Pn-l)- 

The next density pn+i is then generated either by the recursion 

Pn+l = Pn- B~^{F{pn) - Pn) 



where Bn satisfies Eq.(II.5), or by 



Pn+l = Pn - Hn{F{pn) - Pn) 



(11.6) 



(II.7) 



(11.8) 



(11.9) 



where Hn satisfies Eq.(II.6). Sequences generated by Eq.(II.8) or Eq.(II.9) are known as quasi- 



Newton methods. The unknowns in Eq.(II.5) and Eq.(II.6) are the matrices Bn and Hn, hence 



these are systems of k linear equations in k'^ unknowns. There are infinitely many solutions to the 
matrix secant equation and each different solution leads to a different numerical method for finding 
the fixed point F. A common approach in the literature on density functional calculations, due 
to SrivastavsP, is based on the Broyden rank one matrix updates Other common updates in 
the optimization literature are the symmetric rank one (SRI) and the BEGS updated Our focus 
in this study is on improvements in the context of Broyden updates, however the basic principles 
outlined here extend more generally to other matrix secant methods. 



C. Rank One Updates 

A new matrix Bn+i is obtained by updating in some fashion Bn using the new data pair (s^, Un) 
combined with the prior information (sq, Uq), (si, yi), (sn-i, yn-i) subject to the constraint that 
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satisfy Eq.(II.5). In his original paper Broydeii^ looked at two approaches. The first, 



normally referred to as Broyden's "good" method (GB), finds the nearest matrix to Bn with 
respect to the Frobenius norm (i.e. the square root of the sum of squares of the matrix entries) 
that solves Eq.(II.5) (this property, due to Dennis and MorJ^is slightly different than Broyden's 



original interpretation). This is given explicitly by 

{Vn - PnBnSn)Sn 



B. 



n+l 



Bn + 



(11.10) 



/3n 1 1 1 1 ^ 

Here and throughout this work the norm ||s|| = V s^s is the Euclidean norm and a vector (un- 
derstood to be a column vector) or matrix raised to the power indicates the transpose. The 
scalar /?„ > is a step size parameter which has been retained for formal consistency although 
it is normally taken to be unity. This method has a simple recursion for its inverse (needed in 



Eq.(II.8)): 



B 



-1 

n+l 



{PnSn B^ S^B^ 



(11.11) 



(s'^Bn^yn) 

Broyden's first method is shown iri^ (Theorem 5.2) to converge locally superlinearly under the 
standard assumptions that the Jacobian exists, is nonsingular, and Lipschitz continuous at the 
solution. 

The second method proposed by Broyden has the attractive feature that it is a recursion on the 
inverse Jacobian Hn and is given by 

{PnSn - HnVn) Vn 



H, 



n+l 



(11.12) 



Note that our sign convention is different to the usual sign in the literature where H, 



n+l 



-B. 



-1 

n+l- 



Update Eq.(II.12 ) was not recommended by Broyden and subsequently became known as Broyden's 



"bad" method (BB) primarily, in our opinion, because the test problems to which the methods 



were applied favored Eq.(II.lO). Analogously to GB, the BB update finds the nearest matrix to 



Hn with respect to the Frobenius norm that solves Eq.(II.6). 



Generalizing the two methods suggested by Broyden one could consider updates of the form 



B., 



n+l 



Bn + 



Bn^n) 



(11.13) 



Barnes^ and Gay and SchnabeP^ proposed an update of this form where Vn is the projection of 
Sn onto the orthogonal complement of [sn~m, Sn-m+i, ■ ■ ■ , Sn] (0 < m < n). This strategy, known 
as projected updates, will be discussed in more detail in the next subsection. 
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A variation of Eq.(II.12) for the inverse recursion Hn was proposed by Martinez and ZambaldP^ 
that takes the form 

{Sn - HnVn) e^j 



n+l 



T 



(11.14) 



where ||yn||oo is 



where Cnj is the j 'th column of the identity matrix, chosen so that 
the component of yn with largest absolute value. The matrix update then differs from the previous 
update only in the j'th column. Numerical experience with this update is very 

favorabl(P™3. 

Recall that the GB and BB updates are the matrices nearest to the previous matrices with 
repsect to the Frobenius norm that satisfy the matrix secant equation. The main difference between 
the GB and BB methods is the space in which the "nearest" criterion is applied^. For GB the 
criterion is applied in the domain of the mapping, while BB is applied in the range, where the 
domain of the mapping is the space of the density differences s„ and the range is the space of the 



residual differences yn- We see from Eq.(II.8) that an ill-conditioned matrix update i?„ will lead 



to a large and numerically unstable estimation of the step Sn since, to compute this one needs to 



invert On the other hand, from Eq.(II.9) it is clear that a least change criterion in the space of 



the residual differences ?/„ will lead to smaller steps that could slow progress for well-conditioned 
problems. In other words, the BB algorithm is inherently more conservative than the GB algorithm. 
The advantage of this property is not apparent until one is faced with ill-conditioned or ill-posed 



problems - see Section III for a discussion in the context of the physics of a DFT problem. In this 
respect, the distinction between Broyden's first and second methods is analogous to the distinction 
between backward, or implicit, and forward techniques for numerical solutions to stiff differential 
equations. We formulate these statements more precisely below. 



The recursions Eq.(II.10 )-(II.13 ) are all one-step recursions in terms of the most recent Jacobian 



estimate. Most practical implementations make use of multi-step recursions on the inverse Jacobian 



that allow one to avoid explicitly forming the matrix. In particular, the recursion for Eq.(II.ll) 
with /3j = 1 (j = 0, 1, 2, . . . ), can be written aa^ (Theorem 6.2) 



Bn+l — — {Bq ^Yn — Sn) {Ln + S^Bq ^Yn) SnB^ ^ 



(11.15) 



where 



5„ := [so,si,S2,...,Sn], Yn := [yo,yi,y2, ■ ■ ■ ,yn\ (A:-by-(n 1) matrices) (11.16) 
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and 



-1 if i > j 



else 

For the multi-step recursion of Hn+i an elementary calculation yields the following recursion for 



Eq.(II.12): 



(11.17) 



3=0 j=0 \ i=j+l 

where the products ascend from left to right with the empty product defined as 1, and 

.,T 



and Zj := 



T 



We prefer this recursion over the recursion proposed by Srivastava'^ because, as we shall see in the 
following sections, we gain valuable insight into the geometry of the operations for the same storage 
requirements. Srivastava's formulation was implimented for LAPW code bjM^ although extensive 
numerical experience in the Wien2k code indicates that Hq needs to be adjusted dynamically, for 
reasons that will be clearer later. 



In the recursions Eq.(II.15) and Eq.(II.17) the initial matrix, Bq and Hq respectively, is crucial. 
Several authors have studied optimal initialization^23l25l26l2ZIM2Sl_ \Yg explore scalings in greater 



detail in Subsection |III C| A few papers have also explored strategies for combining the updates 
Eq.(II.15 ) and Eq.(II.17) in order to take advantage of the strengths of each. IP and TodcP^derive 
an update that is a convex combination of GB and BB with weighting determined explicitly so 
as to yield an optimally conditioned matrix. Their method is locally superlinearly convergenlP^l 
(Corollary 11) under standard assumptions and the additional conditions that the initial approx- 
imate Jacobian, Bq, is close to the true Jacobian at the solution and that the sequence of matrix 
updates and their inverses are uniformly bounded. MartineJ^ switches between GB and BB based 
on a simple criterion that estimates which method will give the best performance at a given step. 
Numerical experiments showed this to be a promising approach, but rates of convergence were not 
pursued. 



D. Multisecant Methods 



In the context of path following algorithms, it is natural to look only at local information in 
order to construct a step direction and length. To generate the n-l-l-th Jacobian approximation the 
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methods described above satisfy the matrix secant equation Eq.(II.5) or Eq.(II.6) for the current 
step Sn and residual difference y„. However, for highly nonlinear problems, where iterates may 
belong to domains of attraction of different solutions, it is perhaps more appropriate to view the 
previous data as samples of an unknown process in a high dimensional space. In this context, 
updating the Jacobian based only on the most recent sample and ignoring the other sample points 
imposes a bias. Invoking some of the terminology used for stochastic optimization, searching for 
the nearest matrix that satisfies the matrix secant equation only for the most recent sample point 
is a greedy strategy without recourse. 

Multisecant techniques put the previous data on more equal footing with the most recent steps; 
that is, rather than satisfying the matrix secant equation for only the most recent step one satisfies 
all matrix secant equations simultaneously: 

Yn = BnSn, (11.18) 

where Sn = [si,n, S2,n, • • • , Sni,n] and Yn = [yi,n, y2,n, ym,n] are k-hy-m {m < min{n, k}) ma- 
trices whose columns are previous steps and residual differences respectively. Multisecant tech- 
niques have been studied by several authors in the mathematical literature over the last 48 
year j^^ ' ^^ ' ^^ ' ^'^ ' ^^ ' ^^ ' ^^ ' ^^ J^^. A few authors in the physical sciencePSEHIol independently derived 
updates that, we will show, are simple relaxations of more conventional multisecant methods. 

For our application the dimension of the problem, k, is much larger than the number of iter- 
ations, n, so we will simply consider m matrix secant equations with m < n. We do not as yet 
specify the composition of Sn and Yn since there are many options. The construction given in 



Eq.(II.7) is conventional for matrix secant methods; an alternative construction centers the steps 



on the initial point po rather than the previous step as in Eq.(II.7): 

Sj,n = Pj-po and yj^n = {F{pj) - pj) - {F{po) - po), (j = 0, 1, . . . , n - 1). (11.19) 
The method we propose in the following section is centered on the current point 

Sj,n = Pj - Pn and yj^n = {F{Pj) - Pj) - {F{Pn) - Pn) (j = 0, 1, . . . , n - 1). (11.20) 

Many multisecant methods are easily understood by formulating the underlying optimization 
problem each of the approximate Jacobians (implicitly) solves. We consider first the optimization 
problem 

minimize -\\A-Xf + Lc{X) (11.21) 
-—'"xfe 2 
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where, throughout, the norm of a matrix is the Frobenius norm, j4 is a real k x k matrix and D, G 
are real k x m matrices such that the set C defined below is nonempty: 

C := |x G M^^'^ such that XL> = g} . (11.22) 

The corresponding indicator function of C, lc, is defined by 



if X G C 
oo else. 



The solution to the optimization problem Eq.(II.21) expressed as 



aigminx {^\\A-Xf + LciX)^ (11.23) 
is the prox mappin^^^ of the indicator function to C at A: 

Proxi/„^,^(A) := argmin^ ||p - Xf + tc(^)} • 
It is an elementary exercise (see, for instance,'^ (Chapter 1, Section G)) to show that 

proxi/^,,^(A) = Pc{A) (for ah a > 0) 

where 

Pc(^)=argmin^gc{P-^ll} 

is the projection of A onto C. This projection has a simple closed form so long as the columns of 
D are linearly independent: 

Pc{A) = A + {G- AD) {d'^D)'^ D'^. (11.24) 

Specializing to multisecants, if j4 = Bq, the n'th approximation to the Jacobian, D = Sn ^ j^fcxm 
and G = Yn G i^^x™ [\ < < y^), the columns of which are denoted yj and Sj respectively 
(j G [0,n]), then we arrive at the multisecant extension of the Broyden's first update (MSGB) as 
studied bjffZnSEMS 

Bn+l = Pc{Bo) =Bo + {Yn - BoSn) {S^ Sn)'' . (11.25) 
Elementary calculations using the Sherman-Morrison- Woodbury formula yield the multi-step re- 



cursion for BI^ , analogous to Eq.(II.15), 



B-l, = B^' + {Sn - B^^Yn) {{SlSnr'SlB^'YX {SlS^r'S^B^' (11.26) 
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For the update Eq.( |II.13 ) Vn is the projection of s„ onto the orthogonal complement of 
[sn-m, Sn-m+i, ■ ■ ■ , Sn] {0 < m < Ti) where Sj and yj are given by Eq.(II.7). Here the update 
Bn+i will satisfy m + 1 consecutive secant equations Bn+iSj = yj {j = n — m, . . . ,n). Methods 
based on this approach are called projected secant updates. This idea, however, seems not to 
have benefited from the endorsement of prominent researchers at the time, and hence there is 
little numerical experience to recommend it. A notable exception is the work of Martinez and 
collaboratord22E3E2ESl_ Sequences based on update Eq.(II.25) are shown in^^ (Theorem 2.5) to be 



locally q-superlinearly convergent if, in addition to other standard assumptions, the approximate 
Jacobians, i?„ stay close to the behavior of the true Jacobian, and if the columns of Sn are strongly 
linearly independent as measured by the condition number 



K{Sn 



Sry 



An alternative specialization of Eq.(II.24) leads to a multisecant form of Broyden's second 



method (MSBB) if welet A = Hq, D = Yn and G = Sn so that 

Hn+l = Pc{Ho) =Ho + {Sn - HoYn) (if ^n)"' . 



(11.27) 



To our knowledge, there are no published numerical comparisons of Eq.(II.27) to alternatives, nei- 



ther is there any published convergence theory, though we believe this is only a minor modification 



of the theory for Eq.(II.25). 



Yet another specialization of Eq. (II. 24) is a simplex gradient techniqu^^ for vector-valued func- 
tions proposed by Gragg and Stewart^^ and implemented iiP^l. Here the Jacobian update is given 
by 



Bn+l = -Pc(O) = Yn{Sn Sn) ^ S^ . 



(H.28) 



where A = 0, D = Yn and G = Sn where the steps Sn and residual differences yn are centered on 



the initial point po ^ in Eq.(II.19) 



Independent studies appearing in the physics literature that parallel the mathematical literature 
have a different variational form. The various approaches can all be shown to be specializations of 
the the following optimization problem 



minimize 

xeRfexfe 2 



n 

^^a,distl^{X) + ^\\A-X\ 



(11.29) 



where each Gj is a set of the form Eq.(II.22) and A € 



pkxk 



and dist c (^) is the Euclidean 
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distance of X to the set Cj. A short calculation yields the solution to Eq.(II.29) 



X^ 



"0 



Era 
j=0 «J 



En 



a. 



(11.30) 



From Eq.(II.24) this can be written explicitly as 



(11.31) 



where 



a 4 



(11.32) 



Specializing to multisecants, let A = L'j = Sj and G = yj, where Sj and yj are defined 



by Eq.(II.7). Then the optimization problem Eq.(II.29) corresponds to the variational formulation 



of a method proposed by Vanderbilt and Louie^. A local convergence analysis, together with 
numerical tests are studied in''. Our derivation and formulation of the update, however, appears 



to be new and clarifies the connections between their method and Eq.(II.25) above: 



j=0 j=l 



1„T 



(11.33) 



If instead we let let A = Hn, Dj 



yj and Gj = sj, we get the update proposed by JohnsoiM 

(11.34) 



Hn+i = E ^i^" + E ((^J ~ ^"2/i) {yjvj) y 

j=0 j=l 



1 T 
^'3 



Again, our derivation is different, and the new formulation makes the connection with Eq.(II.27) 
more transparent. 

The weighting scheme oi^*^ is similar to a technique proposed by Pulaj'^. A dynamic weighting 
scheme that optimizes the weights 7j simultaneously with the determination of the matrix Hn or 
Bn is possible via the extended least squares techniques outlined in^^. From the analysis above, the 
"active ingredients" in the methods oPlancP involve the nonuniform weighting of the conventional 
multisecant formulations of Broyden's methods, and the centering of the prox mapping on the most 



recent matrix approximation Bn and Hn rather than on Bq and Hq as is the convention in Eq. (11.25 ) 
and Eq.(II.27). A variation of Eq.(II.34) due to Kawata et aP combines the method of Johnson 



with a construction of the columns of Sn and Yn proposed by Pulay and given in Eq.(II.20). We 



note that the methods summarized by Eq.(II.33)-(II.34) and their relatives solve single matrix 



secant equations in parallel while the methods summarized by Eq.(II.25 )-(II.27) solve multiple 
matrix secant equations simultaneously, which is more restrictive. 
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In the above analysis we are not specific about how many previous steps should be included in 
the matrices Sn and 1^. Recall that these matrices are made up of m columns of previous step 
information where m € [1, f^] • If m < n then one is implicitly executing a limited memory technique. 



The name limited memory stems from the sequential ordering of steps according to Eq.(II.7) and 
refers to the practice of excluding points more than m steps previous in the construction of the 
current matrix update'^^. If one constructs Sn and Yn via Eq.(II.20), as we do in the following 



numerical experiments, then one would exclude points that are most distant from the current point 
Pn- This is a reasonable strategy for highly nonlinear problems, where the linear approximation 
that is at the heart of quasi-Newton methods is only valid on a local neighborhood of the current 
point. For extremely large problems such a strategy is also expedient since the matrix updates 
need not be explicitly stored as they can be constructed from a few stored vectors. 

Finally, we note some other multisecant approaches that don't fit into the framework above. 
jjj35] ^ multisecant BFGS update is studied, however there do not appear to be computationally 
efficient and stable methods for generating updates that preserve the algebraic structure of the 
original BFGS update, namely symmetry and positivity. The SRI update, also studied iiP^, is 
shown to be computationally efficient and stable though at the cost of symmetry. In contrast 
to these, Broyden's updates are natural candidates for multisecant approaches since they do not 
enforce symmetry of the Jacobian (they are designed for solving systems of nonlinear equations 
rather than for solving optimization problems as with BFGS and SRI). For problems where the 
Jacobian is ill-conditioned or singular at the solution, Frank and SchnabeP^'^ consider methods 



based on the third terms of the Taylor series expansion in Eq.(II.2), which they called "tensor 
methods", that build upon the multisecant model. They reported on average improvements of 
33% for highly controlled test problems in which the null space of the Jacobian has dimension 1 
or 2. There appears to be little more numerical experience with tensor methods for matrix secant 
approaches. While this may be an avenue for further research, we do not pursue this idea in the 
present work. 

III. NEW DEVELOPMENTS: SAFEGUARDED MULTI-SECANTS 

Newton-like algorithms, such as Broyden's methods, are not global techniques for solving equa- 
tions and can behave wildly, even chaotically, far from a solution. As we have seen in the previous 
section, solving the self-consistent field equations is equivalent to finding the fixed points of the 



SCF mapping F given by Eq.(1.2). The extent to which many algorithms behave, or misbehave. 
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depends on the functional properties of the SCF mapping. Consider, for instance, an even simpler 
algorithm to Broyden's method, namely 

p„+i=F(p„); (III.l) 

that is, we replace the current density with the density under the SCF mapping. This is a simple 
form of a fixed point iteration. If F is a contraction on some closed subset of the space of densities 
(i.e. points move closer to one another under the mapping F), then the sequence pn converges to the 
unique fixed point of F (this is known as the Banach Contraction Theorem see, for instancd^SI). 



If F is not a contraction, then Eq. (III.l) could continue forever without ever approaching a fixed 
point. Successive iterates might form a characteristic path, or they might behave chaotically. One 
could reasonably conjecture that many Kohn-Sham SCF mappings are not everywhere contractive 



since Eq. (III.l) is not a popular numerical method. Less restrictive than contractive mappings 
are nonexpansive mappings (i.e. points do not move further apart under the mapping F). Of 
course, all contractions are nonexpansive, but the converse does not hold. If is a nonexpansive 
mapping on a closed, bounded and convex subset of the space of densities, then the modified 
mapping F(p) := p + \{F{p) — p) for small A > is contractive, and F(p), and hence F, has a 



fixed poinlPSEZI. The iteration to find the fixed point of F corresponding to Eq.( III.l ) in terms of 
the original mapping F is 

Pn+l = F{pn) := Pn + K^ipn) - Pn)- (111.2) 



Most readers will recognize this as the Pratt stepP^. Convergence is guaranteed for nonexpansive 
SCF mappings on compact convex regions (though convergence can be extremely slow), but if A is 
too large, or F is only locally nonexpansive, or worse, not even nonexpansive, then the theory for 



fixed point iterations like Eq.(III.2) and even Broyden's methods does not have much to say and 
numerical behavior cannot be guaranteed. In other words, anything can happen. 

In particular, since it is possible for the SCF mapping to have several fixed points there is no 
guarantee that an algorithm will converge to the correct fixed point, as is the case if the fixed point 
is nonphysical, known as a "ghost band". Indeed, the iteration need not converge at all. Even if 
one supposes that the SCF mapping is contractive on closed neighborhoods of each of these fixed 
points, if an algorithm takes too large a step it could leave the domain of attraction of one fixed 
point and drift towards another fixed point. This process of drifting between attractors of the 
SCF mapping could continue indefinitely, and with a chaotic trajectory, if the algorithm is not 
sufficiently well controlled. 
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Note that one can extend the above concepts to a subset of the density variables. For instance, 
the sp-electron states might behave well, while d-electron states might be very difficult to converge. 
This suggests a certain separability of the variables associated with these quantities, and that the 
SCF mapping is more sensitive to some subsets of the density variables than others. Indeed, a 



frequent observation exhibited in Section IV is that the density within the muffin-tins often behaves 
very differently to the density in the interstitial region. 

The problematic part of the Kohn-Sham mapping is the effective potential Vp. In general, 
there is no closed form for Vp. For certain approximations, denoted V, it is possible to prove the 
correspondence between the fixed points of the corresponding SCF mapping Fy and solutions to the 
Kohn-Sham equations'^, and, moreover, that Fy is a contractioiJ^. However, for exact Vp at finite 
temperatures existence and uniqueness of fixed points is an open question, further complicated by 
the occurrence of systems with multiple coexisting phased. For the practitioner who simply wants 
her software to converge for a particular example, unfortunately this means that the algorithms 
come only with extremely limited warranties that may not even be verifiable. 

With this caveat in mind, and before we present the details of our numerical method, we take 
a moment to describe in physical terms some of the features of ab-initio calculations that are 
problematic, together with common symptoms of poorly convergent problems. 

i. In many cases, for instance bulk MgO, the algorithms reach an acceptable solution in a 
surprisingly small number of iterations, e.g. 10 — 20 for 10^ unknown density components. 
This implies that, at least for a substantial subset of the density parameters, the domain of 
attraction of the fixed point is large and the SCF mapping has "good" functional properties 
on this domain. 

ii. In some cases there can be issues with the scaling of different parts of the density because 
they are represented in quite different fashions. For instance, with LAPW methods the plane 
wave components outside the muffin-tins are represented by the Fourier coefficients whereas 
the density inside the muffin-tins is expanded in terms of spherical harmonics. 

iii. The conventional wisdom for LAPW-based methods is that the muffin-tins should be as large 
as possible without overlapping. This implies that the basis set used for the muffin-tins is 
somehow better suited for the physics or for the geometry of the atoms. This is manifested in 
more rapid convergence of the coefficients corresponding to these basis elements and indicates 
that the domain of attraction of the fixed point for these coefficients is large relative to the 
domain of attraction for the fixed point of the plane wave elements. 
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iv. The most physically interesting problems are often harder to solve. A spin unpolarized DFT 
calculation of NiO, for example, may converge very slowly. The slow convergence of the 
mixing cycle is in part because spin unpolarized the system is metallic, but is also coinciden- 
tal with an imperfect functional description of this system, in which case the Hamiltonian 



in Eq.(1.2) can be ill-posed. It is not uncommon to compromise on the physical model. 



particularly for large and complicated problems. 

V. In some cases, for instance when there are d or / electrons, charge carriers are in a large 
unit cell and for surfaces, mixing converges poorly and can easily diverge. In the literature 
this is called "charge sloshing" because one has oscillations of charge density between dif- 
ferent spatial regions of the problem or between different local states such as d-electrons. 
Mathematically this sometimes corresponds to ill-conditioning when a small change in the 
density p can lead to large change in F{p), with large eigenvalues of the matrix H (or small 



eigenvalues of B). Alternatively, it may be that the higher-order terms of Eq.(II.2 ) are large, 
so neglecting them is only appropriate for a very small change in the density. A third possi- 
bility in the case of charge sloshing is that the SCF mapping is not nonexpansive (and hence 
not contractive) along this trajectory. 

vi. It is possible with a LAPW method for the calculations to enter regions where the new 
density has become unphysical. Such numerical densities are called "ghost bands" and are 



unphysical solutions of the Kohn-Sham equations Eq.(1.2). 



To illustrate these features a simple model is an O2 molecule starting from atomic densities 
where the two atoms are deliberately treated differently, one starting with a spin of +2 the other 
with 0. Shown in Figure [T] is the variation of the spin (a) and total charge within the muffin tins 
(QMT) (b) varies during a representative SCF iteration. While the spins converge smoothly to the 
final solution, the total charge oscillates or "sloshes". Figure [T] shows the behavior of the spin (a) 
and the total charge, QMT (b), within the muffin tins for sequences of densities approaching the 
solution. For this sequence of points the spin of the density shown in Figure [Tj^a), converges 
smoothly to the solution. A view of the charges of the electrons within the muffin tins. Figure 
[Tj^b), tells a different story. The total charge within the muffin tins appears to zig-zag toward the 
solution. 

The behavior of different physical quantities for sequences of points approaching a fixed point is 
independent of algorithm design (as opposed to algorithm performance) and is symptomatic of the 
functional properties of the SCF mapping, i.e. the underlying physical problem. Our purpose is to 
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FIG. 1: Iterates for an O2 molecule with atomic densities having a spin of +2 the other with 0. Figure (a) 
shows the spin within the muffin tins and (b) shows the total charge within the muffin tins (QMT) during 
a representative SCF iteration. 



design an algorithm that performs well for both regular SCF mappings as well as more pathological 
cases. As each model presents unique analytical irregularities, we seek to address the most common 
challenges in our numerical algorithms. 

A. Toward a robust algorithm 

In addition to numerical performance for a wide range of DFT calculations, a good numerical 
method, in our opinion, will require little or no user intervention. A key observation is that the 
dimension of the underlying problem is on the order of 10^ or higher while the information that one 
uses to model the fixed point mapping is at most dimension 2n where n is the number of iterations 
(on the order of lO'^). As mentioned above, the conventional view is that the n steps and residual 
differences generated in matrix secant methods are deterministic points on a path to the solution. 
While this is true for problems of small and moderate dimension, for high-dimensional problems 
such as SCF calculations, we can alternatively consider the n steps as random samples of a high- 
dimensional mapping, albeit with decreasing randomness as the matrix secant model is refined in 



subsequent iterations. We therefore consider the vectors Sj and yj given by Eq.(II.7) as merely 
data samples with less significance given to the order in which the samples were collected than with 
conventional matrix secant approaches. In particular, we center our model on the current iterate 
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and calculate all steps and residual differences computed relative to the current point according to 



Eq.(IL20). The algorithm then predicts the behavior of the SCF mapping Eq.(1.2) at p„ given an 



appropriate model of the previous data. 

The multi-secant methods detailed in the previous sections can all be rewritten as 



Pn+l 



9v 



(III.3) 



where gn = F{pn) — Pn, is a matrix dependent on the method, and Hq is an initial matrix 
secant estimate. Let us write this as 



Un+Pn 



(III.4) 



where, according to Eq.(III.3), pn = —Sn-iAngn and Un = —Hq{I — Yn-iAn)gn- We show in 



Subsection III B that pn is the part of the vector gn that can be explained by (is in the range of) 



the data at step n, and n„, is the part that is orthogonal to this information, and hence unpredicted. 
The numerical challenges described in ([iv|)-([iH) are embedded in the unpredicted component of the 
new step. Written this way, we can distinguish two numerical challenges. First and most obvious, 
one must safeguard against the unpredicted behavior Un- More subtle is the a posteriori accounting 
of the vector gn and an underlying distance to ill-posedness of the prior information, which reflects 
the quality of the information. In other words, there is no free lunch: not only do we have to be 
cautious of unpredicted behavior, but we also have to be careful of how we treat our predicted 
behavior. It is reasonable to expect that with more information the size of the unknown orthogonal 
component Un will decrease, however, this is not always the case if the model does not adequately 
capture the true SCF mapping, or does not adequately adjust for the presence of multiple scales 
within the model. In addition, a poorly conditioned, or nearly ill-posed model might have an 
unstable direction in which numerical errors overwhelm meaningful information on the same scale. 
Our numerical strategy must safeguard against both large unpredicted components and unstable 
models. We show next how this perspective translates into algorithms. 

B. Scaling, regularization, and preconditioning 



The discussion in the previous subsection is easily made rigorous when we consider the multise- 



cant formulation of Broyden's second method Eq.(II.27) MSBB with the steps centered according to 



either Eq.(II.7) or Eq.(II.20). To see this, note that from Eq.(II.9) with Hn replaced by Eq.(II.27|, 



the modification of Eq.(III.3) to MSBB amounts to letting 

An ■= [Yn-iYn-l] Y^i^. 



(III.5) 
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Note that (Y^_^Yn-i) ^ Y^_^gn is the solution to the least squares minimization problem 

minimize i ||y„_iz — (7„|P, (III. 6) 

where m £ [l,n — 1] is the number of previous data points used in the update. In other words, 
idn is the element in the domain of Yn-i that comes closest (in the least squares 
sense) to "predicting" the vector gn- It follows, then, that 



J - Yn-lAn)9n = [l - ^n-l ^Z-l jSn (IH-T) 

is the orthogonal projection of onto the space orthogonal to the residual differences yj defined 



by one of Eq.(II.7) or Eq.(II.20), our prior data. 

The above discussion assumes that Yn-i is full-rank. If the columns of Yn~i are nearly linearly 
dependent, then the inverse (Y,^__^Yn~i) ^ can be numerically unstable. More fundamentally, we 



are implicitly assuming that the approximation to the Jacobian in Eq.(II.3) is, first of all, valid on 
the neighborhood of pn defined by the other data points and, second of all, that the Newton step is 
the right step to take. If either one of these assumptions does not hold, as would be the case when we 
are far from the solution and our sample points are far apart, conventional optimization strategies 
link local and global techniques by allowing steps to rotate between the steepest descent direction 
(in the present setting, the direction of the vector gn) and a Newton-like direction. One well-known 
strategy of this kind is the Levenberg-Marquardt algorithnJ^SEII. "We propose a different technique 
that is an unusual use of a classical regularization technique usually attributed to Tikhono'v^^ZElISIl 
and rediscovered in the statistics community under the name of ridge regression^^, though the 
more general notion of proximal mappings due to Moreau^^ predates both of these. In particular 



we regularize Eq.(III.6) in the usual way: 

minimize i||yn-i^; - flnlP + flUII^, (a > 0). (III.8) 

Recalling the prox mapping introduced in the previous section, we could center the regularization 
more generally on any point zo, however, since we have no prior information about z the logical 



choice for the regularization is to center it on the origin. The solution to Eq.(III.8) is 



Zn = {Y^-iYn-l + al) ' yj_i9n (111.9) 
which yields the following regularization of A^. 

:= {Y^_,Yn-i + aiy' Y^_,. (III.IO) 
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Note that as a — > oo, — > 0, and the step generated by Eq.(III.3) rotates to the direction H^gn- 



If Hq is a scahng of the identity (see Subsection III C ) then in the context of optimization, where 
Qn is actuahy a gradient, the stronger the regularization, the more the step rotates in the direction 
of steepest descent and away from a Newton-hke direction. We thus interpret the regularization 
parameter in both the conventional way, stabilizing {Y^_]Yn~i) ^ ■, and as an estimation of the 
uncertainty of the approximate Newton step. Given our understanding of the previous step data 
as pseudo-random samples from an unknown process, the latter interpretation has a very natural 
explanation in terms of the Wiener filter for a signal with normally distributed zero-mean white 
noise. The size of the regularization parameter corresponds to the energy of the noise, or uncertainty 
in our model. 

JohnsoiP proposes a scaling of the columns of the matrices of Yn and Sn for numerical reasons, 
though this can easily be shown to have no formal impact on the algorithm. In the context of reg- 
ularization, however, such a scaling can have a significant effect on the choice of the regularization 
parameter. This scaling is equivalent to multiplication of the matrices Yn and Sn on the right by 



the diagonal matrix The rescaled least squares problem analogous to Eq.(III.8) is 

minimize i||y„_i^'n-2 — Onll^ + (o > 0) 

with the solution 

{^nY^_^Yn-l^n + nY^_^. 

It follows immediately from this that if we normalize the columns of Yn^i, 



(III.ll) 



(n-l)| 







v 







(III.12) 



i/Wy^^r'^W J 



where yj" is the j-th column of then our regularization parameter will be independent 

of scaling between the columns of the matrix Yn-i- This allows for more robust regularization 
strategies. Viewing the regularization as a Wiener filter applied to the approximate Newton step, 
the scaling reduces the effect of outliers on the regularization parameter in the least squares es- 
timation, these outliers coming from steps that are relatively far away from the solution we seek. 
We denote the matrix corresponding to this scaling, together with the regularization a by An' 
where 



^n{'^nY^_^Yn-l^n + aI) ^ ^nY^_^. 



(III.13) 
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The step is then generated by Eq.(III.3) with replaced by An 



FormaUzing our approach for Broyden's first method MSGB is not as obvious. From Eq.(II.8) 



with B^^ replaced by Eq.(II.26), the modification of Eq.(IIL3) for MSGB amounts to letting 
Ho 



Bq ^ and 



An 



-1 



(III.14) 



Again, we note that (^S'^_iSn-i) ^ S^_iBq^w is the solution to the least squares problem 

minimize ill/S^-i-z — B'r.^wW'^. 

2g]Rn-l ^ 

If, in addition, Bq = l/crl^ then an elementary calculation yields the simplification to Eq. ( 111.14) ) 

An = (5^_iy„_i)"' 5^_i (III.15) 

If {S'^_'^Yn-i) is well-defined, then the mapping / — Yn—i-^n is a nonoTthogoncd projcctioil^^^ onto 
the nuUspace of the columns of Sn-i, or in other words, a projection onto the space orthogonal to 



the range of the columns of Sn-i, our prior step data. Unlike Eq.(III.5) the projection is not to 
a nearest element in the range of S^_i, hence, by definition, the resulting step will be larger than 
the orthogonal projection. 

The above assumes that (S'J_]^l^_i) ^ is well-defined, which is likely, but this says nothing of 
whether or not S'J_]^yn-i is well- conditioned. Indeed, S^_iYn-i need not have real eigenvalues, 
or be positive definite. Regularization of ^ in Eq.(III.14) gives (5^_^y„_i + al) ^ 

which shifts the eigenvalues to the right. Since S'^_iYn-i could have negative eigenvalues, unless 
a is chosen large enough, this regularization could result in an even more ill-conditioned matrix. 
Our numerical experience is that a > 10^® is sufficiently large to avoid this possibility for the 
applications of interest to us. 

We turn next to preconditioning and scaling. We propose rescaling the density p„ to account 
for multiple scales between the interstitial electrons and the muffin tin electrons. There are many 
possible strategies. Such scalings are generically represented by multiplying the density /)„ at each 
iteration n on the left by an arbitrary invertible diagonal matrix The same scaling must also 
be applied to the result of the SCF mapping acting upon p„. If these scalings are applied before the 
mixing operation and undone after the mixer yields its proposed step, then one need not change 



any of the formalism above; specifically, one replaces Yn, Sn, and A^ in Eq.(III.3) with Yn := ^nYn, 



^nSm and. 



^n^'fnSi^^Yn-l^n + alj ^n^^.^l^n (MSGB), Or 
(^n?J_ll^„-l^n + "fnY^-l^n (MSBB) 



(III.16) 
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One of the benefits of such scahngs is to increase the numerical accuracy of matrix multiplication 
when the matrices consist of elements of vastly different scales. 



The preconditioner used in the numerical experiments in Section IV rescales the change in the 
interstitial electrons relative to that in the muffin-tin electrons. Recall that the residual of the 
SCF mapping for the density pn is gn = F{pn) — Pn- We represent the interstitial and muffin-tin 
portions of this residual by gn"^ and respectively where 



9n 



yn 

AM) 
yn 



Denote the averages of the residuals of these components separately by 



j=0 3=0 

Using the formalism above, our preconditioner 0„ is defined by 




ri„ = I I where LOr, 



gn 



(III.18) 



and Ij is the Ij x Ij identity matrix where Ij is the dimension of the interstitial/muffin tin electrons 
respectively. We note that the ujn term enters the multisecant form squared, hence our use of a 
square root. Removing this square root is also reasonable, and in some cases is better in numerical 
tests, but it can be less stable and lead to runaway behavior where the interstitial regions converge 
too rapidly, upsetting the balance between these and the muffin tin electrons. More sophisticated 
preconditioning are also plausible, for instance a dielectric term for the plane wave^SfiEZ!^ though 
we found this simple form to be very effective. 

Before concluding this section we address a physical point. A relevant question is whether 
the step constructed by these matrix secant methods conserves the total charge - if not, then an 
additional constraint is needed. By construction all the Sn, Vn values conserve charge, as does gn, 



and the preconditioners and scalings simply change the matrix given by Eq.(III.16). The result 
will then conserve charge automatically within numerical accuracy, so no explicit charge constraint 
is necessary. 



C. Step Control 

In Broyden's original numerical experiments he constructed Bq from a finite difference approxi- 
mation to the true Jacobian (se^^ (Section 7)). This is not a practical approach for extremely large 
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problems such as DFT calculations. For large problems, the convention for the initial estimate Bq 
or Hq is a scaling of the identity; that is, at each iteration n we choose -ffo.n = CTnl and likewise 
Bo,n = l/<7n- The choice of the scaling is critical ~ if it is poorly chosen iterations can stagnate or 
diverge. Note that the generating matrix can vary at each iteration. With this generating matrix 



the unpredicted component of the step s„, given by Eq.(III.3 )-(III.4 ) is Un = —(Tn{I — Yn-iAn)gn- 
The scaling (T„ has no impact on the predicted component pn = —Sn-iAngn- A technical discussion 
of strategies for choosing (T„ are intimately connected to a convergence analysis of the algorithms, 
which is the topic of subsequent work. 

For our purposes it suffices to give a number of effective controls, with reasonable heuristics. 
To motivate our strategy we consider the numerical experiment. Figure |2] showing the root mean 
squared change of the charge within the muffin tins plotted against various values of a on different 
neighborhoods of the solution. Far from the solution, the variation is non-linear (Figure [2][^a) ) over 
the range 0.05 < a < 0.30. Since the step size is directly proportional to the size of a Figure 
^a) indicates that the linear model is correct only on a small neighborhood of the current iterate. 
If a is too large, the algorithm chooses a step outside the range of validity of the local model; 
for an LAPW code this will lead to ghost bands and can lead to divergence. As the iterations 
proceed, that is, on closer neighborhoods to the solution, the change in the charge as a function 
of a appears to be linear (Figure [2](^b) ) over the same range, indicating that the linear model of 
the SCF mapping is accurate on a larger neighborhood of the iterate. Thus we interpret a in the 
MSBB update as a type of trust-region parameter for a linear model of the SCF mapping. 

Our strategy for implementing a dynamic step length an has three parts. First among these is 
to constrain o"„ so that the step in the direction of the unpredicted component has an upper bound 
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that is proportional to the size of the predicted component: 

(Jn < R\Pn\/\gn\ (III.19) 

where i? is a fixed parameter. In the context of the MSBB update, this ensures that the unpredicted 
(and unpredictable) component of the step at each iterations does not dominate the total step. 
One has to take some step along this component, as otherwise no new information is generated; 
however if too greedy a step is taken the algorithm can diverge. For an "easy" problem R can be 
larger than for a "difficult" problem since for easy problems the unpredicted component of the step 
is naturally well-scaled. Our experience indicates that for robust performance across a wide variety 
of problems, the parameter R is the most important control element for For hard problems a 
value of R from 0.05 to 0.15 works well. As a second level of control, we bound the total variation 
between successive scalings: 

^„ =(7„_i*max(0.5,min(2.0,||(7„_i||/||5„||)). (III.20) 

Note that, by this control, cr„ cannot be less than half nor more than twice the previous scaling. 
Note also that we do not reject steps that yield a larger residual gn, but rather reduce the size of 
the step in the unpredicted direction. In almost all cases a large improvement is achieved in the 
next step by retaining the bad step, though this strategy runs the risk of developing ghost bands. 
As a third level of control, we include an upper bound on the absolute value of the scaling, a. For 
our applications we found a ~ 0.1 — 0.2 to be effective. Finally, for the very first cycle we take a 
small step with 

do = a * (0.1 + exp(-2.0 * max(dQ, dPW/3.5, dRMT))), (III.21) 

where dQ is the change in the charge within the muffin tins, dPW is the change in the rescaled 
plane waves and dRMT is the change of the density within the muffin tins. This form is based 
upon numerical experience with Wien2k, and is somewhat conservative. 

D. Summary 

Algorithm III.l (Regularized, preconditioned, limited- memory multisecant method) 



0. Choose an initial po, ao according to Eq.(III.21 ), generate pi = po + ^{F{po) — Po) for A > 
some appropriately chosen step length (this is the Pratt step Eq.(III.2)), set n = 1 and fix 
a > (10-6 to 10-^). 
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1. If the convergence criterion is met, terminate. Otherwise, given Sn-i and Yn-i, whose 
columns are steps sj and residual differences yj respectively (j = n — m, ,n— (m — 1), . . . ,n— 1 
for some appropriate number of prior steps, e.g. m = min{n, 8}) centered on the current 



point pn as in Eq.(II.20), calculate X'"^"'^" via Eq.(III.16) for either MSBB or MSGB with 



the scaling given by Eq.(III.12) and the preconditioner given by Eq.(III.17)-(III.18 1. 



Determine the value of cj„, according to 



an = mm{an, R\Pn\/\gn\, cr} 



(III.22) 



where is given by Eq.(III.20) and a is some appropriately chosen upper bound (0.1 to 



0.2). Calculate the next step Pn+i according to Eq.(III.3) with An replaced by 



2. Evaluate F{pn+i), set n = n + 1 and repeat Step 1. 



IV. RESULTS 



We test the performance of the algorithm on five examples of increasing physical difficulty, all 
run using the Wien2k cod^ and the PEE functionaPS we provide the details below with technical 
information so they can be reproduced as well as reasons for their choice. 

Model 1 Simple bulk MgO, spin-unpolarized with RMT's of 1.8 a.u., an RKMAX of 7 and a 
5x5x5 /c-point mesh and a Mermin-functionaP (i.e. Fermi-Dirac distribution) with a 
temperature of 0.0068eV. This is a very easy to solve problem. 

Model 2 Bulk Pd, spin-unpolarized with RMT's of 2.0 a.u., an RKMAX of 7.5, a 5 x 5 x 5 fc- 
point mesh and a Mermin-functional with a temperature of 0.0068eV. This is slightly harder 
because of the possibility of sloshing between the d-electron states and the fact that one 
should use a larger sampling of reciprocal space. 

Model 3 A bulk sihcon cell with an RMT of 2.16 a.u., an RKMAX of 7.0, a 6 x 6 x 6 A;-point 
mesh and a Mermin-functional with a temperature of O.OOlSeV. 

Model 4 A2x2x2Pd superceh with a vacancy at the origin, RMT's of 2.5 a.u., an RKMAX of 
6.5, a fc-point mesh of 3 x 3 x 3 and a Mermin-functional with a temperature of 0.0068ey. 
Here, in addition to sloshing between d-electron states one can have longer-range dielectric 
sloshing. In addition, this is a poorly constructed problem because the RKMAX is too small 
as is the /c-point mesh. 
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Model 5 A 4.757 x 4.757 x 34.957 a.u., spin-polarized (111) fee niekel surfaee with seven atoms 
in the range — 1/3 < z < 1/3. Teehnieal parameters were RMTs of 2.13, an RKMAX of 7 
and a 11 X 11 X 1 A;-point mesh, also with a Mermin-funetion temperature of 0.0068ey. It 
should be noted that the two surfaees are suffieiently elose together, so there is real eleetron 
density in the vaeuum. In this ease one ean have spin sloshing, d-eleetron sloshing as well as 
long-range Coulomb sloshing of eleetrons in the vaeuum. 

In all eases we started from densities ealeulated as a sum of independent atoms, and the ealeula- 



tions were run with both forms of Broyden multiseeants given by Eq.(II.27) and Eq.(II.25), as well 



as the more eonventional Broyden first Eq.(II.lO) and seeond Eq.(II.12) methods. Convergenee 
eriteria were an energy ehange of 10~^ Rydbergs and an RMS eonvergenee of the eharge within 
the muffin tins of 10~^ eleetrons. For the multiseeant implementations eight prior memory steps 
were used. To simplify the results, unless noted otherwise we used fixed values of the regulariza- 
tion parameter a of 10~^ and K = 0.1. In almost all eases. Figure [s] shows that the eonvergenee 
appears to be linear, although the preeision of the ealeulations does not allow one to observe the 
final asymptotie behavior, ineluding rates of eonvergenee, of the algorithms. 



TABLE I: Iterations to convergence as a function of a for models 1 — 5 with fixed a ~ 10^"* and R — 0.1. 
The mean and standard deviation are for a between 0.05 and 0.8 for Models 1 — 3, 0.05 to 0.5 for Models 4 
and 5. 
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BE 


GB 




mean 


stdev 


mean 


stdev 


mean 


stdev 


mean 


stdev 


Model 1 


16.22 


0.44 


14.56 


1.01 


18.67 


1.66 


22.44 


12.71 


Model 2 


12 





12.89 


0.33 


20.11 


0.78 


39.78 


9.58 


Model 3 


15.44 


2.35 


16.78 


0.67 


25.22 


2.95 


57.56 


7.76 


Model 4 


24.17 


2.04 


29.17 


4.92 










Model 5 


54.60 


3.51 















For the very simple Model 1 all the methods eonverge quiekly and the parameter a has no 
signifieant impaet on performanee. The MSGB method is slightly faster, but as the latter results 
indieate this is an exeeption. If cr is too small (below 0.025) eonvergenee is slower as illustrated 
in the plot of the number of iterations to eonvergenee versus both a and a shown in Figure |4| 
Interestingly, even for this very simple ease the multiseeant methods are significantly faster. 
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FIG. 3: Plot of the convergence for models 1-5 using the multisecant update based on Broyden's first 
method (MSGB) and second method (MSBB), compared to Broyden's second method (BB) and Broyden's 
first method (GB). In model 5 the only algorithm to converge is MSBB. 



For the slightly more complicated Model 2, both multisecant methods converge rapidly, whereas 
the BB method converges more slowly and the GB method is worst by a significant margin. The 
principal difference between Models 1 and 2 is that in Model 1 there are large changes during the 
iterations both within the muffin-tins, as well as for the plane waves, whereas in Model 2 almost 
all the changes are in the plane waves. This supports the rule-of-thumb discussed earlier that one 
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FIG. 4: Number of iterations to convergence for MSGB on Model 1 as a function of the value of the 
algorithm parameters a and a . The surface plot has been fitted by a polynomial to obtain smooth contours 
representing the behavior. 

should make the muffin tins as large as possible without overlapping. 

With Model 3 the multisecant methods significantly outperform the classical secant methods. 
For bulk silicon much of the covalent bonding lies in the interstitial region. We conjecture, therefore, 
that the improvement is due to the improved step direction and size for the multisecant methods 
that allow these methods to handle the greater variations of the Kohn -Sham mapping for this 
basis set. 

The same trend continues with both Model 4 and Model 5 to the extent that BB and GB only 
converge for "good" values of a (which have to be found by trial and error) and in many cases 
diverge. For the hardest problem we report here. Model 5, only the MSBB method converged. If 
one added a line search the other methods would probably converge albeit less rapidly and with a 
many more SCF evaluations. 

The (J parameter in the MSBB update gives one direct control over the size of the steps, which 
is an important feature for models with strong variations. The control of steps is less immediate 
for the MSGB update and involves a more sensitive coupling of the regularization parameter a and 
the step size parameter a. This is illustrated by the greater variance in performance of the MSGB 
update versus MSBB for models 1, 2, 4, and 5 shown in Table |l| 
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V. DISCUSSION 

To summarize the main points of this work: 

• We argue that for DFT problems, where many physically interesting models result in non- 
contractive SCF mappings, one should consider the information from previous points of the 
SCF cycle more as samples of a highcr-dinicnsional space than as part of a deterministic 
path. As a consequence multisecant methods are better than sequential secant updates, as 
born out in the results. 

• There is a fundamental difference between methods based upon Broyden's first (GB) and 
second (BB) methods in terms of the space they operate in. The second method is more 
robust and handles poorly constructed, (nearly) ill-posed problems better - in general these 
are the more interesting physical problems. 

• Scaling, regularization and preconditioning have a significant impact on algorithm perfor- 
mance. Moreover, regularization acts simultaneously to reduce instabilities due both to 
linear dependencies as well as to deficiencies in the model. 

• Controlling the step size (T„ along the direction about which no information is available is 
critical. For difficult problems, this step should in general be smaller than for easy problems. 

• The multisecant method based upon Broyden's second formulation (MSBB) with appropriate 
safeguards simply and quickly solves problems which may defeat a novice, sometimes even 
an expert. 

The method we have detailed (MSBB) is robust and has been part of the main Wien2k distribution 
(7.3) for some months without any apparent problems. Even in the hands of an experienced user 
for complicated problems such as LDA-I-U we have been told of cases where the MSBB version 
is three times faster than the earlier BB code. The default values of a = 10~^ and R = 0.1 will 
be approximately correct for a pseudopotential code where preconditioning the variables is not 
necessary though there are strong variations Kohn-Sham mapping. We have not attempted to 
impliment the MSBB algorithm for a pseudopotential code but see no reason why it should not 
work at least as well. One can of course adjust these parameters to improve a single problem, but 
we recommend values that perhaps are slightly slower in a few cases, but more robust for a wide 
variety of problems. There may also be ways to stabilize MSGB so that it could possibly work 
better for pseudopotential codes where preconditioning is easier. 



29 



Some additional comments are appropriate about the role of the term in the regularization. As 
mentioned earlier, we are using this simultaneously in three ways, firstly as a standard regulariza- 
tion technique to avoid ill-conditioning associated with near linear dependence of the columns of 
Yn, secondly as a Levenberg-Marquardt-type strategy to rotate the step and thirdly in a standard 
Wiener filter sense to account for model uncertainty. The regularization parameter can be con- 
sidered to scale proportionally to the noise or uncertainty in the secant equations. Far from the 
solution the quasi-Newton step may not be appropriate, suggesting that one should use a larger 
regularization. Similarly, near the solution if the quasi-Newton step is accurate, it will yield faster 
rates of convergence, in which case one would choose a smaller regularization. While one could 
dynamically adjust the regularization parameter, for our numerical experiments we choose a rela- 
tively large fixed value of a ( 10~^). This, in our experience, yields adequate overall convergence 
and better convergence in the "dangerous" early stages of the iterations. 

Wc emphasize once again the link between convergence of the mixing process and the functional 
properties of the underlying Kohn-Sham mapping. A poorly constructed problem will in most cases 
converge much more slowly than a well constructed one. This may be a consequence of short-cuts 
in the DFT calculation, e.g. too few fc-points or numerical errors in an iterative diagonalization, or 
it can be due to a poorly constructed Hamiltonian or perhaps density functional. For the general 
user poor convergence should be taken as a suggestion that the model of the physics may not have 
been properly constructed. 

The fact that we that we obtain improvement under the assumption that most models of 
physical interest do not lead to contractive, or more generally monotone SCF mappings raises 
some questions. It is well established that current density functionals are inexact descriptions of 
the physics, but the exact analytic properties of many physical systems are unknown. In particular, 
for many systems it is not known whether the SCF operator is monotone, let alone that it has 
fixed points, although it is hard to conceive of an experimentally observable equilibrium structure 
that does not have fixed pionts. An interesting question to raise is whether the SCF operator is 
monotonic with the "true" density functional that correctly describes the physics. Since in many 
cases the effective potential Vp has no closed form, it is not known whether many of these theoretical 
properties are verifiable. It is tempting to infer analytic properties from numerical experiments - 
and we have made numerical progress by doing just this - but one cannot on numerical evidence 
alone determine the extent to which numerical behavior is indicative of the true nature of the 
physical system. As a final speculation, we raise the question of whether the character of the 
SCF mapping can be experimentally measured, or whether this type of behavior is a mathematical 
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anomaly resulting from being much further away from equilibrium than any feasible experimental 
system will ever be. 

There are several directions of research with regard to algorithms. Firstly, the heuristics for 
adjusting the step size (T„ need to be put on firm mathematical footing. This would accompany a 
study of the asymptotic behavior of the algorithm and is the subject of future research. While the 



analysis of Eq.(III.5 ) has attractive interpretations in terms of nearest points in the range and space 
orthogonal to the prior data, the notion of "nearest" is with respect to the usual Euclidean (L^) 
norm, which is biased towards outliers. One could consider the development of algorithms based on 
weighted norms, or even non-Euclidean prox mappings as opposed to those detailed in Subsection 
IIDI The considered b>EEllQl 

is in the spirit of weighted norms. Other areas for improvement 
could be found in the initialization of the iterations. We used the Pratt step, however one could 
use information from a previous SCF iteration. 

Finally, while we have used some physics in helping to design the algorithm, there may be more 
that could be exploited. We find particularly appealing the observation discussed at the beginning 



of Section III that the density appears to be separable into distinct subsets. One might envision 
tailoring algorithms to exploit this property. For instance, one could iterate on the components 
of the density associated with the muffin-tins, while holding the interstitial electron density fixed. 
Alternatively, one could iterate on the sp-electron density holding the d-electron density fixed, or 
one could iterate on other observables such as the spin associated with a particular atom. Such 
an approach might allow one to isolate irregular variables within the SCF mapping and design 
algorithms accordingly. This general approach is known as operator splitting about which there is 
a vast literature, (see, for instance^^ and references therein). This would allow one to isolate the 
analytical properties of the SCF operator and work more directly with specific physical quantities. 
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